QVAC-3697: Load GGUF File From Buffer #1

jesusmb1995 · 2025-07-30T18:23:07Z

This pull request makes changes in Llama.cpp in order to be able to load models directly from memory. It is intended to be reviewable by commit. Individual commits contain a long text description below the header.

Tested that works properly from a bare Addon (LLM repo). See #1 (comment)

In particular, this PR exposes:

llama-cpp.h:llama_model_load_from_buffer(vector<uint8_t>&& data, ...) to load from a single buffer containing a .gguf file contents.
llama.h:llama_model_load_from_split_futures(char** paths, ...) and llama-cpp.h:llama_model_load_fulfill_split_future(char* path, ..., unique_ptr<basic_streambuf<uint8_t>>&& streambuf) which allow to asynchronously/incrementally load a model and upload its tensors to the backend storage while host memory is being released.

How to run the code?

Build and prepare model

Build (e.g. in release mode) LLama.cpp including the examples, tests and tools:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_EXAMPLES=ON -DGGML_VULKAN=ON && cmake --build build

Generate a sharded model and its *.tensor.txt summary file:

./build/bin/llama-gguf-split --split --split-max-size 300M models/qwen3/Qwen3-0.6B-Q8_0.gguf Qwen3-0.6B-Q8_0 &&
 mv Qwen*.* models/qwen3

Automated tests

Run automated tests for a single gguf file:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0.gguf
ctest -R ^test-model-load-disk$ --verbose
ctest -R ^test-model-load-memory$ --verbose

Run automated tests for sharded model:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf
ctest -R ^test-model-load-disk$ --verbose
ctest -R ^test-model-load-memory-split$ --verbose

Or simply run all tests:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0.gguf
ctest

Should output:

...
30/41 Test #30: test-backend-ops ..................   Passed  104.24 sec                                                     
      Start 31: test-model-load-cancel                        
31/41 Test #31: test-model-load-cancel ............   Passed    0.34 sec                                                     
      Start 32: test-model-load-disk                          
32/41 Test #32: test-model-load-disk ..............   Passed    0.43 sec                                                     
      Start 33: test-model-load-memory                        
33/41 Test #33: test-model-load-memory ............   Passed    0.00 sec                                                     
      Start 34: test-model-load-memory-split                  
34/41 Test #34: test-model-load-memory-split ......   Passed    0.67 sec 
...
41/41 Test #41: test-eval-callback ................   Passed    0.84 sec

100% tests passed, 0 tests failed out of 41

Label Time Summary:
curl             =   0.84 sec*proc (1 test)
eval-callback    =   0.84 sec*proc (1 test)
main             = 136.15 sec*proc (35 tests)
model            =   1.79 sec*proc (5 tests)

Examples

Demo video: https://drive.google.com/file/d/1mjqecwJ1LFYUNofr4wIdPFK9IkUxbHZh/view?usp=sharing

Set up the environment:

# Do not export any variable to load from disk
# export LLAMA_EXAMPLE_MEMORY_BUFFER=1
export LLAMA_EXAMPLE_MEMORY_BUFFER_SPLIT=1

# Alternatively pass a single .gguf file and set _MEMORY_BUFFER=1
export GGUF_PATH="models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf"

Run example with Qwen3:

/usr/bin/time -v ./build/bin/llama-simple -m "$GGUF_PATH"

Outputs:

...
print_backend_buffers_info: offloading 28 repeating layers to GPU
print_backend_buffers_info: offloading output layer to GPU
print_backend_buffers_info: offloaded 29/29 layers to GPU
print_backend_buffers_info:      Vulkan0 model buffer size =   199.11 MiB
print_backend_buffers_info:  Vulkan_Host model buffer size =   157.65 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    44.65 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    46.78 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.84 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    45.71 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    45.71 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.83 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.84 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    46.78 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    31.89 MiB
llama_context: constructing llama_context
llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 35
llama_context: n_ctx_per_seq = 35
llama_context: n_batch       = 64
llama_context: n_ubatch      = 64
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (35) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
create_memory: n_ctx = 64 (padded)
llama_kv_cache_unified: layer   0: dev = Vulkan0
llama_kv_cache_unified: layer   1: dev = Vulkan0
llama_kv_cache_unified: layer   2: dev = Vulkan0
llama_kv_cache_unified: layer   3: dev = Vulkan0
llama_kv_cache_unified: layer   4: dev = Vulkan0
llama_kv_cache_unified: layer   5: dev = Vulkan0
llama_kv_cache_unified: layer   6: dev = Vulkan0
llama_kv_cache_unified: layer   7: dev = Vulkan0
llama_kv_cache_unified: layer   8: dev = Vulkan0
llama_kv_cache_unified: layer   9: dev = Vulkan0
llama_kv_cache_unified: layer  10: dev = Vulkan0
llama_kv_cache_unified: layer  11: dev = Vulkan0
llama_kv_cache_unified: layer  12: dev = Vulkan0
llama_kv_cache_unified: layer  13: dev = Vulkan0
llama_kv_cache_unified: layer  14: dev = Vulkan0
llama_kv_cache_unified: layer  15: dev = Vulkan0
llama_kv_cache_unified: layer  16: dev = Vulkan0
llama_kv_cache_unified: layer  17: dev = Vulkan0
llama_kv_cache_unified: layer  18: dev = Vulkan0
llama_kv_cache_unified: layer  19: dev = Vulkan0
llama_kv_cache_unified: layer  20: dev = Vulkan0
llama_kv_cache_unified: layer  21: dev = Vulkan0
llama_kv_cache_unified: layer  22: dev = Vulkan0
llama_kv_cache_unified: layer  23: dev = Vulkan0
llama_kv_cache_unified: layer  24: dev = Vulkan0
llama_kv_cache_unified: layer  25: dev = Vulkan0
llama_kv_cache_unified: layer  26: dev = Vulkan0
llama_kv_cache_unified: layer  27: dev = Vulkan0
llama_kv_cache_unified:    Vulkan0 KV buffer size =     7.00 MiB
llama_kv_cache_unified: size =    7.00 MiB (    64 cells,  28 layers,  1 seqs), K (f16):    3.50 MiB, V (f16):    3.50 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 64, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =   64, n_seqs =  1, n_outputs =   64
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =   64, n_seqs =  1, n_outputs =   64
llama_context:    Vulkan0 compute buffer size =    37.34 MiB
llama_context: Vulkan_Host compute buffer size =     0.27 MiB
llama_context: graph nodes  = 1126
llama_context: graph splits = 2
Hello my name is Emily. I'm a student in the 10th grade. I'm interested in studying in the field of mathematics. I want to kn
ow how to study
main: decoded 32 tokens in 0.18 s, speed: 174.70 t/s

llama_perf_sampler_print:    sampling time =       2.62 ms /    32 runs   (    0.08 ms per token, 12195.12 tokens per second)
llama_perf_context_print:        load time =     402.14 ms
llama_perf_context_print: prompt eval time =      10.13 ms /     4 tokens (    2.53 ms per token,   394.91 tokens per second)
llama_perf_context_print:        eval time =     166.08 ms /    31 runs   (    5.36 ms per token,   186.65 tokens per second)
llama_perf_context_print:       total time =     575.19 ms /    35 tokens

	Command being timed: "./build/bin/llama-simple -m models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf"
	User time (seconds): 0.37
	System time (seconds): 0.44
	Percent of CPU this job got: 88%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.93
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1101056
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 225849
	Voluntary context switches: 796
	Involuntary context switches: 15
	Swaps: 0
	File system inputs: 0
	File system outputs: 32
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Run example with GTE:

# GGUP_PATH points to gte-large.Q2_K-00001-of-00003.gguf, for example.
/usr/bin/time -v ./build/bin/llama-embedding --model "$GGUF_PATH" --ngl 999

Related PRs

Memory load in LLM addon repo: https://github.com/tetherto/qvac-lib-infer-llamacpp-llm/pull/195
Base inference JS class to allow shards loading. https://github.com/tetherto/qvac-lib-infer-base/pull/39
Fix crash in LLM addon repo https://github.com/tetherto/qvac-lib-infer-llamacpp-llm/pull/194

Asana task: https://app.asana.com/1/45238840754660/project/1210873391319186/task/1210877463428607

To see the specific tasks where the Asana app for GitHub is being used, see below:

Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)

Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.

This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.

The gguf-split utility now generates a `.txt` listing all tensors. Useful both for manual inspection/debugging and for incremental tensor loading where its not possible to know tensors present in other split files (the information is critical to handle optional tensors).

jesusmb1995 · 2025-07-30T18:51:14Z

I seem to lack permissions to add reviewers. It is on draft until I test it on a bare Addon but the review of the Llama.cpp C++ code can start: @olyasir @olek-tether @gianni-cor @chetasr @yuranich @jpgaribotti

jesusmb1995 · 2025-07-30T20:17:35Z

Updated tests to automatically skip based on the gguf filename (sharded or not) when running all tests at once.

jesusmb1995 · 2025-08-14T15:12:15Z

Un-drafting since I was able to run JS integration test for qwen3 llm Addon without problems. The test now can use any dataloader implementation and will incrementally load the Llama.cpp model. See successful log below.

log_integration.txt

jpgaribotti · 2025-08-14T16:00:37Z

We should not merge to master, it will make maintaining the fork more difficult. For example, we currently have another PR to merge from upstream to bring the fork up to date. We should create a differently named branch for our changes to the fork.

yuranich · 2025-08-14T18:33:43Z

We should not merge to master, it will make maintaining the fork more difficult. For example, we currently have another PR to merge from upstream to bring the fork up to date. We should create a differently named branch for our changes to the fork.

can we do the following:

finish updating from upstream
create new branch, merge this fix there
try to contribute back to upstream
is that something we can do?
I also saw there is multimodal branch, is that something we can consider contributing back? @jpgaribotti

jesusmb1995 · 2025-08-18T07:54:01Z

Fine with me. Please create a tether branch where to merge the changes @yuranich

3. try to contribute back to upstream
   is that something we can do?

I have a task in the Asana the project to do this, but I don't know how easy will it be with the amount of changes. Maybe we can merge some of the commits.

yuranich · 2025-08-19T06:20:27Z

Fine with me. Please create a tether branch where to merge the changes @yuranich
3. try to contribute back to upstream
   is that something we can do?
I have a task in the Asana the project to do this, but I don't know how easy will it be with the amount of changes. Maybe we can merge some of the commits.

temp-load-from-buffer
created @jesusmb1995

src/llama-model.cpp

jesusmb1995 · 2025-08-21T08:56:45Z

Force-pushed to attempt to fix CI on some platforms, due to different compilers/configs it was failing on some of them.

- Ensures a char trait implementation for uint8 exists, that can be used with std::basic_streambuff. - Adds an implementation of std::basic_streambuff for a single vector. Will be used by llama.cpp and tests when loading from a single memory buffer.

Override the pure virtual interface with a class that can operate on a single memory buffer.

Auxiliary function to convert a list of C strings to a vector of C++ strings.

Add new GGUF reader implementation that can read metadata from a memory buffer.

- Add code to be able to load a gguf file from a variant (memory or disk). - Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).

Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.

Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.

A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.

Handles the logic for incrementally loading files and tensors is model shards.

Refactor backend buffer creation (for model loading) into functions.

- The function now takes size_data instead of the member attribute. - Sanity checks of file pointer handles These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.

Adapt the loader and model load to incrementally load files and upload tensors.

Add functions to Llama.cpp public headers to asynchronously load shards.

Split some common loading functionallity. This will help in the memory loading tests.

Add a submodule with re-usable code for tests.

Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.

Adapt simple example to showcase how to load from memory. Can be configured with environment variables. Qwen3, for example, can be used with the simple example.

Add some automatic tests that load from memory (single buffer or multiple async splits)

jesusmb1995 · 2025-08-21T11:38:48Z

Most CI pipelines pass now. Some target failures seem unrelated.

jesusmb1995 · 2025-08-21T14:23:02Z

Most CI pipelines pass now. Some target failures seem unrelated.

@jpgaribotti @yuranich Can you suggest what to do with remaining failing CI pipelines? Seem to be due to unrelated issues, for example:

Run ARTIFACTS_JSON=$(curl -s -L \
Finding latest macos-latest-Release artifact...
No suitable Dawn artifact found!

Is it okay to proceed with the review as it is?

jesusmb1995 added 4 commits July 30, 2025 17:13

[common] Pure interface for files

f8942e7

Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)

[common] Compile time debug logs

ca481ed

Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.

[aux] Test full load from disk

cd6f698

This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.

jesusmb1995 marked this pull request as draft July 30, 2025 18:24

jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 02227e3 to 0718c30 Compare July 30, 2025 20:16

jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 5df4e25 to 52ed642 Compare July 30, 2025 20:49

jesusmb1995 self-assigned this Aug 14, 2025

jesusmb1995 marked this pull request as ready for review August 14, 2025 15:08

jesusmb1995 requested review from olek-tether, olyasir, jpgaribotti, yuranich and gianni-cor August 14, 2025 15:08

jesusmb1995 requested a review from chetasr August 14, 2025 15:14

jesusmb1995 changed the title ~~Load GGUF File From Buffer~~ QVAC-3697: Load GGUF File From Buffer Aug 18, 2025

olek-tether approved these changes Aug 18, 2025

View reviewed changes

olek-tether self-requested a review August 18, 2025 20:26

jesusmb1995 changed the base branch from master to temp-load-from-buffer August 19, 2025 07:30

jesusmb1995 force-pushed the jmb/memory_load_pr branch from 52ed642 to 85405d9 Compare August 19, 2025 15:13

jesusmb1995 commented Aug 19, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

jesusmb1995 force-pushed the jmb/memory_load_pr branch from 720734c to 45d84b8 Compare August 21, 2025 08:55

jesusmb1995 force-pushed the jmb/memory_load_pr branch 7 times, most recently from 4277f06 to 4d263be Compare August 21, 2025 10:44

jesusmb1995 added 18 commits August 21, 2025 13:24

[mbuffer] Llama file buffer implementation

b6f825d

Override the pure virtual interface with a class that can operate on a single memory buffer.

[refactor] C splits into C++

86da48c

Auxiliary function to convert a list of C strings to a vector of C++ strings.

[common] GGUF reader from memory

cba0254

Add new GGUF reader implementation that can read metadata from a memory buffer.

[refactor][mbuffer] File load from variant

610d73e

- Add code to be able to load a gguf file from a variant (memory or disk). - Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).

[refactor] Process file method

762c968

Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.

[mbuffer] Expose single-buffer loading to Llama interface

be62aaa

Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.

[fbuffers] Future file buffer implementation

3a0855d

A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.

[fbuffers] Incremental loading of future files

85c4d3b

Handles the logic for incrementally loading files and tensors is model shards.

[refactor] Create backend buffers

bd60c89

Refactor backend buffer creation (for model loading) into functions.

[refactor] Load all data

0561525

- The function now takes size_data instead of the member attribute. - Sanity checks of file pointer handles These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.

[fbuffers] Incremental model load

77cef5b

Adapt the loader and model load to incrementally load files and upload tensors.

[fbuffers] Expose async interface

ff882fe

Add functions to Llama.cpp public headers to asynchronously load shards.

[refactor] Increase common loading granularity

dab6554

Split some common loading functionallity. This will help in the memory loading tests.

[aux] Common test

f5902e8

Add a submodule with re-usable code for tests.

[aux] Memory example (embedding)

425e192

Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.

[aux] Memory example (simple)

f0e7125

Adapt simple example to showcase how to load from memory. Can be configured with environment variables. Qwen3, for example, can be used with the simple example.

[aux] Auto. memory loading tests

cd1b485

Add some automatic tests that load from memory (single buffer or multiple async splits)

jesusmb1995 force-pushed the jmb/memory_load_pr branch from 4d263be to cd1b485 Compare August 21, 2025 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QVAC-3697: Load GGUF File From Buffer #1

QVAC-3697: Load GGUF File From Buffer #1

Uh oh!

jesusmb1995 commented Jul 30, 2025 •

edited

Loading

Uh oh!

jesusmb1995 commented Jul 30, 2025

Uh oh!

jesusmb1995 commented Jul 30, 2025

Uh oh!

jesusmb1995 commented Aug 14, 2025 •

edited

Loading

Uh oh!

jpgaribotti commented Aug 14, 2025

Uh oh!

yuranich commented Aug 14, 2025

Uh oh!

jesusmb1995 commented Aug 18, 2025 •

edited

Loading

Uh oh!

yuranich commented Aug 19, 2025

Uh oh!

Uh oh!

jesusmb1995 commented Aug 21, 2025

Uh oh!

jesusmb1995 commented Aug 21, 2025

Uh oh!

jesusmb1995 commented Aug 21, 2025

Uh oh!

Uh oh!

QVAC-3697: Load GGUF File From Buffer #1

Are you sure you want to change the base?

QVAC-3697: Load GGUF File From Buffer #1

Uh oh!

Conversation

jesusmb1995 commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run the code?

Build and prepare model

Automated tests

Examples

Related PRs

Uh oh!

jesusmb1995 commented Jul 30, 2025

Uh oh!

jesusmb1995 commented Jul 30, 2025

Uh oh!

jesusmb1995 commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpgaribotti commented Aug 14, 2025

Uh oh!

yuranich commented Aug 14, 2025

Uh oh!

jesusmb1995 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuranich commented Aug 19, 2025

Uh oh!

Uh oh!

jesusmb1995 commented Aug 21, 2025

Uh oh!

jesusmb1995 commented Aug 21, 2025

Uh oh!

jesusmb1995 commented Aug 21, 2025

Uh oh!

Uh oh!

jesusmb1995 commented Jul 30, 2025 •

edited

Loading

jesusmb1995 commented Aug 14, 2025 •

edited

Loading

jesusmb1995 commented Aug 18, 2025 •

edited

Loading